Skip to content

fix(update_service): fixes TaskDefinition inactive exception#135

Open
rokibulislaam wants to merge 3 commits intoGetSimpl:masterfrom
rokibulislaam:fix/TaskDefinition-inactive-error
Open

fix(update_service): fixes TaskDefinition inactive exception#135
rokibulislaam wants to merge 3 commits intoGetSimpl:masterfrom
rokibulislaam:fix/TaskDefinition-inactive-error

Conversation

@rokibulislaam
Copy link
Collaborator

@rokibulislaam rokibulislaam commented May 4, 2023

  • Added a new tag in task definition "task_definition_source" to identify if task definition was created with CloudFormation or with boto
  • Included a preflight step to verify whether the existing deployment is dirty, and if it is, whether an image with the tag 'master' is available on ECR.

More context:

Whenever we execute the create_service command, a CloudFormation template is generated. Within this template, in the ECS Task definition section, we set the default Docker image URL with the master tag if the current git status is dirty or if the work tree is not clean. Additionally, we keep the DesiredCount as zero since, at this stage, we don't have an image uploaded to ECR. If we were to proceed with deployment, it would fail. Consequently, at this point, CloudFormation creates a task definition with the revision Family:1.

On the other hand, when we execute the deploy_service command, we build a Docker image and upload it to ECR with a tag based on the current git status or a specific version provided using the --version option. If the work tree is dirty, we append the dirty tag to the Docker image and push it to ECR. Next, we manually create a new task definition using boto3, utilizing the Docker image URI we just deployed with the corresponding image tag. We then increase the desired count from 0 to 1, prompting ECS to start the task. At this stage, boto3 creates a new task definition Family:2, while also deactivating Family:1 using boto3.

What's the issue?

The problem arises when we execute the update_service command, as we rely on CloudFormation to make the necessary changes. We regenerate the CloudFormation template and request CloudFormation to create a changeset. CloudFormation compares the old and new templates, resulting in the creation of a changeset. Now, two scenarios can occur:

  1. If we make changes to the cloudlift service configuration that affect the task definition, such as updating the memory_reservation which modifies memoryReservation in the ECS task definition. In this scenario, CloudFormation creates a new task definition with a new revision, Family:3. However, in the new task definition, it sets the new image URI with the master tag. The CloudFormation stack update is successful, but since we had a dirty worktree, we only have one image deployed on ECR with the dirty tag. As a result, ECS attempts to run the task but continually fails since the image with the master tag does not exist. This is where the preflight check becomes relevant.

  2. If we make changes to the cloudlift configuration that do not affect the task definition, such as updating the ELB/ALB health check path with health_check_path, which alters the Load Balancer and target group. Once again, we create a changeset and execute it. However, CloudFormation attempts to reference the last task definition created by itself, which in our case is Family:1. Nonetheless, we manually deactivated this task definition using boto3. Consequently, the CloudFormation stack update fails with a client exception: TaskDefinition is inactive, Status Code: 400. To address this, we are introducing a new tag (task_definition_source) to the task definition to identify whether the previous task definition was created by CloudFormation or boto3. If it was created by CloudFormation, we do not deregister the task definition.

Point to remember is CloudFormation always attempts to reference the task definition it created, regardless of the version number. Even if you manually created a new task definition, such as Family:999, and updated the service to run from Family:999, CloudFormation would still try to reference the version it created during its last operation. Could this be a bug in CloudFormation or is it designed to behave so?

- Added a new tag in task definition "task_definition_source" to identify if task definition was created with cloudformation or with boto
- Included a preflight step to verify whether the existing deployment is dirty, and if it is, whether an image with the tag 'master' is available on ECR.
@praveenraghav01 praveenraghav01 self-requested a review May 4, 2023 08:32
@rokibulislaam
Copy link
Collaborator Author

rokibulislaam commented May 29, 2023

One crucial case it's not covering is, if someone tries to update only the properties which don't affect TD during update_service for example, updating only the ELB health check path. But this pre-flight check is not letting it go beyond if an image with master tag is not pushed to ECR (if the current deployment is dirty) irrespective of the changes like an example discussed earlier.

Need to cover this, what's your opinion @praveenraghav01 ?

@rokibulislaam rokibulislaam self-assigned this May 29, 2023
@praveenraghav01
Copy link
Contributor

I will suggest that we only do this if the changes include creation of a new Service. For rest we can just ignore image check

repo_name = service_name + '-repo'
res = ecr_client.batch_get_image(repositoryName=repo_name, imageIds=[{'imageTag': 'master'}])
if res['images'] == []:
raise UnrecoverableException("Current deployment is dirty. Please push an image tagged as 'master' to ECR.") No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further thought, we should just bail stating that the current deployment is dirty and that an update cannot be made.

@@ -164,4 +169,4 @@ def _print_progress(self):
if "FAIL" in final_status:
log_err("Finished with status: %s" % (final_status))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should emit a non-zero exit code

repo_name = service_name + '-repo'
res = ecr_client.batch_get_image(repositoryName=repo_name, imageIds=[{'imageTag': 'master'}])
if res['images'] == []:
raise UnrecoverableException("Current deployment is dirty. Please push an image tagged as 'master' to ECR.") No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add cloudlift specific command to push to ecr. This can be done with upload_to_ecr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants